Hybrid metagenome assemblies link carbohydrate structure with function in the human gut microbiome

Complex carbohydrates that escape small intestinal digestion, are broken down in the large intestine by enzymes encoded by the gut microbiome. This is a symbiotic relationship between microbes and host, resulting in metabolic products that influence host health and are exploited by other microbes. However, the role of carbohydrate structure in directing microbiota community composition and the succession of carbohydrate-degrading microbes, is not fully understood. In this study we evaluate species-level compositional variation within a single microbiome in response to six structurally distinct carbohydrates in a controlled model gut using hybrid metagenome assemblies. We identified 509 high-quality metagenome-assembled genomes (MAGs) belonging to ten bacterial classes and 28 bacterial families. Bacterial species identified as carrying genes encoding starch binding modules increased in abundance in response to starches. The use of hybrid metagenomics has allowed identification of several uncultured species with the functional potential to degrade starch substrates for future study.

Prior to publication, the paper should be improved in two ways: the writing of the results section must be tidied up, and the authors should push harder to take advantage of their unique and highquality MAGs. These are summarized in the major comments.
Major comments **YOU CANNOT PRESENT NEW RESULTS IN THE DISCUSSION.** The results, discussion, and methods must be revised so that all new results are reported in the results section. The discussion should place the results in a broader context of other studies, and should not reiterate the results beyond what is needed to set this up. For example, on line 275 you state "We found the greatest number and diversity of CAZyme genes were in the genomes of Bacteroidetes," but this was not reported in the text of the results, though it may have been buried in a figure or table. If a finding is important enough to re-state in the discussion, it should be written out in the results.
Overall, the results section of this manuscript reads more like a methods/results, and the discussion section reads more like a results/discussion. Methods that are not new/innovative or are not directly relevant to reporting the results should be confined to the methods section. Your paper would benefit greatly by expanding the later sections of the results, carefully stepping through the findings from your MAGs. This is the major payoff of your work, and it gets short shrift in your results! Expansion of E. coli in fecal samples does NOT indicate contamination (Line 126). It is likely an important result that warrants reporting. You repeated it across 3 time points, though you have no replicates. I appreciate you running a negative control sample. It seems like a lot of your negative control consists of back-contamination from your real samples, as evidenced by the high abundance of P. copri. Real lab contamination with no crossover from high-abundance samples is often revealed by taxa such as Ralstonia, Pelomonas, Bradyrhizobium, P. acnes, and other taxa that would almost never be found in feces.
I know you are limited from a small sample size, but can you push a little harder to establish the link between the CAZyme content and response to substrates? Do genomes with similar CAZymes respond similarly to different substrates? Does this hold for genomes in different phyla, which might have a similar substrate response?
Can you say anything about how the carbohydrate-active genes are clustered into operons across the various genomes? It seems a shame to build these nice long contigs and then not use them for an analysis of which genes are close to each other. People who do Illumina-only sequencing are normally not able to assemble contigs long enough to reliably capture whole operons, and will be jealous that you are able to carry out such an insightful analysis.
How does the CAZyme composition for each genus correspond to the composition in reference genomes from NCBI? Do your results reflect what is already known about the genera, or have you found things that are totally new?
The authors propose a bunch of new candidate species names. Do you have a record of prior publications that would indicate that such suggestions are adopted by the wider community of scientists? If not, I suggest you forego the introduction of candidate names. My worry is that the candidate names will muddy the taxonomic waters, and will not be adopted as real species names anyway, when the day comes that these bacteria are cultured and deposited. I am more than willing to defer to the suggestion of the editor on this issue.
I have a number of minor comments that will improve the presentation of results.
(Line 104) The study diagram in Figure 1 is very nice for giving readers an overview of the study. However, presenting all the details of the analytical workflow, including software used, may be too much for a main figure. Please consider moving some of these details into a supplemental figure to increase the impact of the main figure.
(Line 122, 149) The text says, "Error! Reference source not found." Not sure what that's about. The heatmap in Figure 2 works well as an overview of the taxonomic composition, but does a poor job of displaying which species. Increase with each carbohydrate source. I strongly suggest adding a new panel to show which species increase the most for each carbohydrate source. Adding this panel will support your presentation of the results in the text. Suggest moving Figure 3 to supplemental. These quality metrics are not surprising, but will be of interest to readers who are hoping to do something like this in their own lab. Use your main figures to show off the MAGs and the CAZyme composition.
(Line 138) You can add statistical support for this statement by running a PERMDISP test to show that the distance among samples on day 0 was much less than the distance among samples on subsequent days. This test is implemented as betadisper() in the vegan package for R. You can also manually collect the distances among samples within each day and run a linear model/ANOVA to show that the distances are smaller on day 0.
Adding a color legend to the chart would vastly improve Figure 4. Also, if the regions of the bar charts correspond to the scatter plot, how come I see a lot of orange in the bar charts but no orange points in the scatter plots? Figure 5. It is jarring to see a phylogenetic tree that spans multiple phyla, but is rooted in the middle of the Clostridia. We know from many other studies that the true root should place the phyla into separate clades. You probably set the root to the midpoint? Suggest either (1) adding an archaeal genome as an outgroup to find an appropriate root or (2) manually setting the root to a location that places the phyla into separate clades (will be a few nodes up the tree from Bifidobacteria).
Reviewer #2 (Remarks to the Author): In the manuscript, "Linking carbohydrate structure with function in the human gut microbiome using hybrid metagenome assemblies", the authors have performed fermentation studies with human stools samples in the presence of different types of complex sugars. While this is an excellent study, my major concern is that the authors claims about the changes of gut bacteria abundances over time, seem to be performed in only one sample.

Major Concern
Since we know that there are several different microbial compositions of a "healthy gut", it makes this author wonder if these results are repeatable with stool samples from different donors. I would suggest that the authors consider either altering their paper to focus on the new species and genera observed and their gene compositions in relation to the treatment OR repeat this experiment in 2 additional donors.

Minor Concern
The authors have missing references on lines 122 and 149.
Ravi et al use in vitro methods to characterize the microbiome response to various carbohydrate sources. This is an interesting study that will be helpful to researchers in the microbiome space as they seek to build a deeper understanding of carbohydrate utilization by gut bacteria. The sequencing and bioinformatics are carried out very nicely, allowing the authors to produce highquality metagenome-assembled genomes and make some new observations about CAZymes and growth on different substrates.
The study is primarily limited by a lack of replicates, which precludes a statistical analysis of the results. However, the authors should be granted leeway here, because running more replicates now would likely increase the costs beyond the available funds. In future studies, I would recommend that the authors run more replicates, decrease the sequencing depth to keep costs down, then pool the replicates for assembly. Such an approach would allow them to add a statistical analysis and increase the impact of their paper, without a huge cost increase.
Prior to publication, the paper should be improved in two ways: the writing of the results section must be tidied up, and the authors should push harder to take advantage of their unique and highquality MAGs. These are summarized in the major comments.
We would like to thank the reviewer for their positive assessment of our work and their detailed and constructive comments. We provide answers to the individual comments point-by-point below. We have extensively revised the manuscript in response to the reviewers' comments, moving significant sections from the results to methods which describe mainly methodological approaches. We have removed any discussion of new findings from the discussion to the results, and we have extended the description of the MAG's. As these changes are extensive, we have not included them in detail here, they are highlighted in red in the manuscript.

Expansion of E. coli in fecal samples does NOT indicate contamination (Line 126). It is likely an important result that warrants reporting. You repeated it across 3 time points, though you have no replicates. I appreciate you running a negative control sample. It seems like a lot of your negative control consists of back-contamination from your real samples, as evidenced by the high abundance of P. copri. Real lab contamination with no crossover from high-abundance samples is often revealed by taxa such as Ralstonia, Pelomonas, Bradyrhizobium, P. acnes, and other taxa that would almost never be found in feces.
We agree that the expansion of E. coli needn't be a contamination since this was not identified in Time 0 for normal maize, nor was it identified in the negative control. Therefore, we have reinstated normal maize in our analyses, results and discussion. In particular, we have included n.maize sample in Figure 4 and in Supplementary Figures 2, 3  We have further established a link between the MAGs and starch treatment by two routestaxonomy of the MAGs and CAZyme content.
By using taxonomy of the MAGs, we identified several MAGs that have been previously discovered as degraders of the different carbohydrate treatments such as the well-known starch degrading species, Ruminococcus bromii. R. bromii was identified in the most recalcitrant starch treatments i.e. Hylon, potato and R. maize treatments. Bifidobacterium species was identifies increasing in Maize starch treatments (r.maize, n.maize and Hylon). Previous studies have characterized Bifidobacterium as a starch-degrading genus. The only Bifidobacterium species to increase in abundance in response to Hylon was B. adolescentis, which is known to utilise this hard-to-digest starch better than other Bifidobacterium species; a broader range of Bifidobacterium species (B. animalis, B. catenulatum, adolescentis, B. longum) increased in abundance in response to the more accessible r.maize and n.maize substrates, suggesting these species may be better adapted to more accessible starches. MAGs. In addition, in our analyses, Bacteroides uniformis was identified increasing from 12h during inulin fermentation and has been previously characterised as an inulin-degrading species. In addition to this, Faecalibacterium prausnitzii increased in abundance with inulin supplementation and has been shown to have the ability to degrade inulin when co-cultured with primary degrading species.
CAZyme and PUL content in MAGs was used to identify novel degraders and reiterate previously identified starch degraders. Collinsella aerofaciens_J (cluster 29_1), Candidatus Minthovivens enterohominis (cluster 81_1) are novel genomes that showed a 2x log -fold increase when in the presence of inulin and harboured multiple copies of inulinases (GH32). Bacteroides uniformis was identified during inulin fermentation had three copies of the GH32 (inulinase) gene and a gene encoding the inulin binding domain, CBM38. The GH32 and CBM38 genes present in this MAG were found to be organised into a single PUL in the genome of Bacteroides uniformis with the organisation; GH32-GH32;CBM38-unk-GH32-GH32-susD-susC, providing further evidence that these genes are likely to be involved in inulin degradation. In the potato treatment, a less well characterised Rumminococcus species, Candidatus Ruminococcus anthropi with ten GH13 genes and one CBM48 gene was identified. A previously uncultured Blautia species, Candidatus Blautia hennigii, was identified possessing eight GH13 and three CBM48 genes which increased in abundance in response to Hylon and potato. We also identified four further previouslyuncharacterised species that increased in abundance and had more than five GH13 genes:
These findings are described in the results section at Lines 188 to 211 and lines 245 to 264 4. Can you say anything about how the carbohydrate-active genes are clustered into operons across the various genomes? It seems a shame to build these nice long contigs and then not use them for an analysis of which genes are close to each other. People who do Illumina-only sequencing are normally not able to assemble contigs long enough to reliably capture whole operons, and will be jealous that you are able to carry out such an insightful analysis.
We have extended our analysis of the CAZyme content of our MAGs to include analysis of organisation of CAZymes into Polysaccharide Utilisation Loci (PULs) to reflect the ability of our approach to capture complete gene operons. We have inserted a new supplementary table (Supplementary Table 12) which is a table of the identified PULs in our MAGs. We have included the following text describing these results: Line 220: We further analysed the genome organisation of the CAZyme's identified by dbCAN2 using the tool PULpy[30], which identified Polysaccharide Utilisation Loci (PULs) in a total of 21 MAGs (Supplementary Table 12). All the PULs identified were within the phylum Bacteroidetes, and the most PULs identified within a single MAG was 79 found in Butyricimonas faecihominis. These statistics for numbers of PULs identified are comparable to other studies published using the same tool [17].
Line 247 The GH32 and CBM38 genes present in this MAG were found to be organised into a single PUL in the genome of Bacteroides uniformis with the organisation; GH32-GH32;CBM38-unk-GH32-GH32-susD-susC, providing further evidence that these genes are likely to be involved in inulin degradation.

How does the CAZyme composition for each genus correspond to the composition in reference genomes from NCBI? Do your results reflect what is already known about the genera, or have you found things that are totally new?
We have carried out a direct comparison of the CAZyme content and composition for the MAGs in our study to the nearest available reference genome available in NCBI. This was done for the 37 MAGs that increased in abundance in response to different substrates and are the most relevant MAGs for the paper. Focussing on GH family genes, the overall gene count was generally similar between the MAGs and the reference genomes. We found that the GH counts per genome were higher for 9 of the MAGs, and higher in 27 of the NCBI reference genomes, with one genome where the number of GH genes was identical between MAG and NCBI reference genomes.
We note that several of the genomes for which higher GH counts were observed in the MAGs from our study compared to NCBI reference genomes were cases where the reference genomes were obtained from environmental samples rather than isolates. This suggests that for MAGs from species that have not been isolated, the approach in this paper can yield novel information. We also note that differences in GH gene abundance between the MAGs from our study and NCBI reference genomes may reflect strain level differences in GH gene content in genomes.
We have not included this information in the paper as it does not add significantly to the overall story, but we include it below for the reference. I have a number of minor comments that will improve the presentation of results. Figure 1 is very nice for giving readers an overview of the study. However, presenting all the details of the analytical workflow, including software used, may be too much for a main figure. Please consider moving some of these details into a supplemental figure to increase the impact of the main figure.

(Line 104) The study diagram in
We have followed the reviewer's suggestion. Figure 1 has now been streamlined. We have included an additional Supplementary figure 1 with the full details of the workflow.

(Line 122, 149) The text says, "Error! Reference source not found." Not sure what that's about.
The error at line 122 and line 149 is a formatting error when converting figure hyperlinks to PDF. The hyperlink to figure 2 has now been removed, so Figure 2 is now correctly referred to in the text.
9. Figure 2 is not referenced in the text. We thank the review for this suggestion. To clearly indicate species that are increasing or decreasing from a particular treatment, we calculated fold changes between the abundance of the species found at Time 0 compared to the other time points. This provided a clear overview of the species that increase the most for each carbohydrate source. The fold changes were converted to log ratios and plotted as a heatmap (Supplementary figure 2).

This Supplementary figure is referenced in the manuscript in Line 128
11. Suggest moving Figure 3 to supplemental. These quality metrics are not surprising, but will be of interest to readers who are hoping to do something like this in their own lab. Use your main figures to show off the MAGs and the CAZyme composition.
Thank you for the suggestion. This is done and is shown in Supplementary figure 6

(Line 138) You can add statistical support for this statement by running a PERMDISP test to show that the distance among samples on day 0 was much less than the distance among samples on subsequent days. This test is implemented as betadisper() in the vegan package for R. You can also manually collect the distances among samples within each day and run a linear model/ANOVA to show that the distances are smaller on day 0.
We thank the reviewer for this suggestion. We did the Permdisp test using the betadisper function. This helped us to calculate the distances between the communities between time points from all treatments. The distance of the treatments at Time 0 was close to zero while the distances at time 6h, 12h and 24h showed was greater. In addition to this, ANOVA was run using the distances between the time points. The microbiome diversity changes between time 0 to Time 6h, 12h and 24h was significant with Time 0 to time 24h with the largest distance.
Boxplot displaying the distances is shown in supplementary figure 5. The results of the ANOVA test are referenced in the text at Lines 147-151.

Adding a color legend to the chart would vastly improve Figure 4. Also, if the regions of the bar charts correspond to the scatter plot, how come I see a lot of orange in the bar charts but no orange points in the scatter plots?
We thank the reviewer for pointing out this error. The colors in Figure 3 had been accidently switched so that there were more grey than orange points, not vice versa. Now this error has been rectified. We also included a color legend and the number of MAGs associated to each color bar.
14. Figure 5. It is jarring to see a phylogenetic tree that spans multiple phyla, but is rooted in the middle of the Clostridia. We know from many other studies that the true root should place the phyla into separate clades. You probably set the root to the midpoint? Suggest either (1) adding an archaeal genome as an outgroup to find an appropriate root or (2) manually setting the root to a location that places the phyla into separate clades (will be a few nodes up the tree from Bifidobacteria).
We thank the reviewer for the suggestion. We have added Methonobrevibacter smithii as an outgroup and rooted the tree. Now the phylogenetic tree looks more arranged along the diverse phyla. This is now named as Figure 4 Reviewer #2 (Remarks to the Author): In the manuscript, "Linking carbohydrate structure with function in the human gut microbiome using hybrid metagenome assemblies", the authors have performed fermentation studies with human stools samples in the presence of different types of complex sugars. While this is an excellent study, my major concern is that the authors claims about the changes of gut bacteria abundances over time, seem to be performed in only one sample.
We thank the reviewer for their positive assessment of the manuscript, and we address their comments below.

Since we know that there are several different microbial compositions of a "healthy gut", it makes this author wonder if these results are repeatable with stool samples from different donors. I would suggest that the authors consider either altering their paper to focus on the new species and genera observed and their gene compositions in relation to the treatment OR repeat this experiment in 2 additional donors.
We agree with the reviewer and would like to refer the reviewer to the changes made in response to reviewer #1, who has raised a similar point. In response to this, we have significantly rewritten both the results and discussion in order to provide a greater focus on the gene compositions of the new genera observed in relation to the treatments which have been carried out. I refer in particular to our response to points 1,3 and 4 raised by reviewer #1

Minor Concern
The authors have missing references on lines 122 and 149.
This was due to an error introduced by a hyperlink in the document and has now been rectified.

Reviewers' comments:
Reviewer #1 (Remarks to the Author): The researchers used an in vitro fermentation system to incubate a human fecal sample with different carbohydrate substrates. They carried out deep shotgun sequencing with long and short read platforms, which allowed them to construct high-quality metagenome-assembled genomes (MAGs). They then carried out an analysis of carbohydrate-active genes in the MAGs.
The authors have conducted a solid experiment, have dedicated substantial resources to long-read sequencing, and have done an excellent job of assembling and presenting the MAGs. However, the analysis falls short in terms of quantitatively connecting the new genomes to carbohydrate utilization and does not take full advantage of the opportunities to build relevant knowledge from a data set with so much potential. It is my sincere hope that the authors will be willing to take the analysis a few steps further and substantially increase the impact of their work.
Major comments 1. The analysis of CAZyme families, which might be the *starting point* for linking carbohydrate structure with function, is limited to only a single paragraph at the end of the results. This paper was just heating up when you ended it! Once you have the CAZyme annotations, there are so many great options for additional analysis. a. How are the GH families correlated with the carbohydrate source across all of your genomes? How does this meet or differ from expectations? b. Why do the majority of MAGs NOT change strongly with the carbohydrate source? If GH diversity is higher in a genome, is it able to utilize more carbohydrates from each source, and thus differ less from one source to another? c. Quantitatively, how do the CAZyme profiles in your MAGs compare to those in existing reference genomes? d. Are other genes (not in the CAZyme DB) correlated with the carbohydrate source? e. Can you use your results to predict how a new or hypothetical MAG will respond to a change in carbohydrate source? f. From previous publications, can you look up the relative amount of various carbohydrate linkages in your sources and relate this quantitatively to the GHs detected in your MAGs? g. What fraction of carbohydrate-active genes were not accounted for by your MAGs? Rather than a laundry list of things to do, these ideas might provide loose direction as the authors think carefully about how to take better advantage of the foundation that has been laid in their results so far.
2. The authors understandably refrain from statistical analysis, which is understandable in light of not having replicates for each carbohydrate source. Despite this limitation, there are quantitative and sometimes even statistical approaches that should be employed to reinforce the points made in the paper. For example, "the 0h profiles clustered closely together" (line 138) is testable by permutations (PERMANOVA test of distance). In the final subsection, statements about GH gene abundance should be quantitative, and may be testable depending on how you ask the question. In each subsection, I urge the authors to think about how the results could be more quantitative, and if the question might be posed in a way that a statistical comparison is appropriate.
3. In many ways, the results dwell too much on the construction and characterization of the MAGs, instead of focusing on what we might learn from them. We get an entire subsection on hybrid vs. short-read-only assemblies, but it seems obvious that hybrid assemblies will kick the pants off short-read-only methods. Then, we have another subsection on taxonomic annotation of MAGs. My advice would be to shorten the results here, so we can get to learning what the MAGs have to teach us.

Response to reviewers' comments for manuscript COMMSBIO-21-2064A
Reviewer #1 (Remarks to the Author): The researchers used an in vitro fermentation system to incubate a human fecal sample with different carbohydrate substrates. They carried out deep shotgun sequencing with long and short read platforms, which allowed them to construct high-quality metagenome-assembled genomes (MAGs). They then carried out an analysis of carbohydrate-active genes in the MAGs.
The authors have conducted a solid experiment, have dedicated substantial resources to long-read sequencing, and have done an excellent job of assembling and presenting the MAGs. However, the analysis falls short in terms of quantitatively connecting the new genomes to carbohydrate utilization and does not take full advantage of the opportunities to build relevant knowledge from a data set with so much potential. It is my sincere hope that the authors will be willing to take the analysis a few steps further and substantially increase the impact of their work.
We thank the reviewer for their positive and constructive comments regarding our manuscript. In response to the reviewers suggestions, we have added several additional steps to our analysis which we believe allows us to obtain greater insights from our data. Rather than a laundry list of things to do, these ideas might provide loose direction as the authors think carefully about how to take better advantage of the foundation that has been laid in their results so far.
In response to this comment, we have added several further analyses which provide more insight into our data. As suggested by the reviewer, we have not addressed this comment point-by-point, but rather have selected additional analyses based on the points raised which we believe to significantly extend the impact of the work.
In response to point 1d, we have included analysis using the HUMAnN3 pipeline which calculates gene abundances and related this quantitatively to substrate composition by analysing fold changes in gene abundance. We have included the following text to describe this in the methods (Line 587-593): In response to several of the points raised by the reviewer, specifically 1a, 1b and 1f we have included a more quantitative analysis of the CAZyme results to relate them to the species abundance for each of the substrates. Additionally, we have narrowed the focus to extracellular secretory enzymes likely to be involved in degradation of large polymeric substrates. This has yielded interesting additional results highlighting CBM74, a recently discovered raw starch binding domain, as a factor which is present in several of the not previously isolated MAG's linked to recalcitrant starch degradation. We have included supplementary figures 10 and 11, and